Palczewska etal. Journal of Cheminformotics 201 3, 5:16 
http://www.jcheminf.eom/content/5/1/16 



• Journal of 
Cheminformati 



RESEARCH ARTICLE Open Access 



Using Pareto points for model identification in 
predictive toxicology 

Anna Palczewska*, Daniel Neagu and Mick Ridley 



Abstract 

Predictive toxicology is concerned with the development of models that are able to predict the toxicity of chemicals. 
A reliable prediction of toxic effects of chemicals in living systems is highly desirable in cosmetics, drug design or food 
protection to speed up the process of chemical compound discovery while reducing the need for lab tests. There is 
an extensive literature associated with the best practice of model generation and data integration but management 
and automated identification of relevant models from available collections of models is still an open problem. 
Currently, the decision on which model should be used for a new chemical compound is left to users. This paper 
intends to initiate the discussion on automated model identification. We present an algorithm, based on Pareto 
optimality, which mines model collections and identifies a model that offers a reliable prediction for a new chemical 
compound. The performance of this new approach is verified for two endpoints: IGC50 and LogP. The results show a 
great potential for automated model identification methods in predictive toxicology. 
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Background 

Predictive toxicology is concerned with the development 
of models that are able to predict the toxicity of chemicals 
[1]. These models are continuously built and validated on 
large collections of toxicological experimental studies to 
discover new biologically active compounds that are more 
effective, selective, less toxic, or satisfy various toxicolog- 
ical criteria [2,3]. A reliable prediction of toxic effects of 
chemicals in living systems is highly desirable in domains 
such as: cosmetics, drug design or food safety. This knowl- 
edge allows an earlier rejection of those chemicals that 
may fail the testing phase and reduces the cost of manu- 
facturing chemical compounds in the development stage. 
Additionally, the European Commissions Legislation of 
Registration, Evaluation and Authorization of Chemicals 
(REACH) [4] allows the registration of chemicals that 
were developed using in silico modelling, which facilitates 
a reduction in the number of animal tests. These two fac- 
tors have contributed to increased interests from research 
and business communities in development of toxicologi- 
cal modelling systems that are focused on data integration, 
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model development and predictions (e.g OpenTox [5], 
InkSpot [6] or OCHEM [7]). 

Quantitative Structure-Activity Relationship (QSAR) 
or Structure-Activity Relationship (SAR) models (both 
regression and classification) are the most common 
and widely used methods to relate chemical struc- 
ture/properties with their biological, chemical or environ- 
mental activities [8]. According to the Organisation for 
Economic Co-operation and Development (OECD) Prin- 
ciples for QSAR Model Validation [9], a model should 
be statistically significant and robust, have its application 
boundaries defined and be validated by an external dataset 
[10,11]. A model applicability domain [12,13] determines 
the boundary of the chemical sub-space where the model 
makes reliable prediction for a given activity. Applying 
models for chemicals from outside of their applicability 
domains increases the likelihood of inaccurate prediction. 

There is an extensive literature associated with the best 
practice of model generation and data integration [14-19] 
but management and identification of relevant models 
from available collections of models is still an open prob- 
lem. In recent years a large number of highly predic- 
tive models, having various applicability domains, has 
become publicly available. Some of them, tested on a wide 
chemical space, have become officially approved tools, 
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e.g. KOWWIN (estimates the log octanol-water partition 
coefficient) or BCFBAF (estimates fish bioconcentration 
factor) built into Estimation Program Interface (EPI) Suite 
[20]. There is also a large number of quality models that 
are applicable only for a narrow chemical space. Some 
of them are annotated according to the OECD princi- 
ples and publicly available in databases like JRC QSAR 
Models Database [21]. This database includes reports of 
model generation, validation and prediction according to 
the OECD standards. QSAR Model Reporting Format 
(QRMF) and QSAR Prediction Reporting Format (QPRF) 
have been developed at the Computational Toxicology 
and Modelling lab of the JRCs Institute to standardise 
annotation of model meta-information. Currently, there 
is a lot of effort to build the ontologies for QSAR exper- 
iments and to provide an interoperable and reproducible 
framework for QSAR analyses [22] . 

Models that are stored in model databases can be reused 
to predict toxicity of new chemical compounds. Unfortu- 
nately, this involves a manual process of model identifica- 
tion. A potential user is required to make a comparison 
of model applicability domains and their predictivity for a 
given activity in order to decide if the model can make reli- 
able predictions for a given chemical compound. Model 
comparison is a difficult task since models are gener- 
ated using various subsets or various chemical compound 
descriptors. Consequently, models can be trained and val- 
idated on different datasets. For regression models, the 
model performance can be described by the predictive 
squared correlation coefficient q 2 . Since the sizes and 
contents of modelling and validation datasets may dif- 
fer for various models, the value of q 2 is not sufficient 
for model comparison [10]. Several model performance 
matrices were analysed in the context of model validation 
and model selection [14]. They are applied in automated 
model development where models are validated by the 
same dataset. In the case where two models come from 
different sources, model comparison becomes challeng- 
ing. This requires predictive models to be validated across 
the entire chemical space, which is very difficult as the list 
of available chemicals and assays may be limited. 

Clearly, there is a need for automated techniques for 
mining model repositories. This includes methods for 
model quality control, data and model integration, model 
comparison and model identification. Our research aims 
to address this gap. In this paper, we draw attention to the 
importance of existing models' usage in predictive tox- 
icology. We also introduce methods for effective model 
identification for a new unseen chemical compound. The 
term "model identification" covers the whole range of 
problems related to model selection from a collection 
of models (for a given endpoint) developed on various 
datasets. In the extreme case, datasets (and specified 
applicability domains) for two models can be disjoint. 



Model identification is a much harder problem than the 
well known model selection problem [23], i.e choos- 
ing a model from a set of candidate models with the 
same applicability domains. Therefore, various methods 
applied in traditional model selection [24-27] cannot be 
directly applied to model identification. In contrast to 
model selection, model identification cannot take into 
account model variables or parameters since some model 
variables cannot be easily accessed for new chemical 
compounds. 

The interesting questions here are whether efficient 
model identification is possible based on molecular struc- 
tures and models performances, and how good the iden- 
tified model can be for a new chemical compound. In 
[28], authors defined the framework for automated model 
selection and described a simple algorithm for model 
selection. The method selects the most predictive model 
from the collection of models for a nearest neighbour to 
the query chemical compound. Often, the nearest neigh- 
bourhood can contain more than one element and model 
performances can differ slightly. In this case, it is difficult 
to say which model would be the most reliable for a given 
chemical compound. 

To answer the above question, in this paper we present 
a new method for model identification for regression 
models. This method uses Pareto points [29] to define 
the nearest Pareto neighbourhood according to two cri- 
teria: structural similarity of chemicals and models per- 
formances. In the next section a framework for model 
identification, Pareto points and their properties are intro- 
duced. Having the Pareto nearest neighbourhood defined, 
we present two methods for model identification. The 
first method averages model performances for all Pareto 
neighbours and identifies the one with the smallest error. 
The second method identifies a model for which the 
Pareto point is the closest (based on Euclidean distance) to 
a centroid of all points in the Pareto neighbourhood. We 
also demonstrate that model identification improves the 
quality of the test set, or unseen chemical compound pre- 
diction. Experimental work using IGC50 for Tetrahymena 
pyriformis and internal Syngenta LogP datasets show that 
our approach provides good results and it is worth being 
considered for further research. 

Methods 

Framework for model identification in predictive 
toxicology 

There are several chemical compound representations 
and thousands of available chemical descriptors [8] used 
for predictive model development. In this paper, a chem- 
ical space X is a set of chemicals represented by pairs 
x = (x d ,xf), where x d e M. Kl represents a vector of 
descriptor values, o$ e {0, l} /<2 is a fingerprint, and K\ + 
I<2 is the dimension of the chemical space. Descriptors 
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represent various topological, geometrical, physical and 
chemical properties of a chemical compound. A finger- 
print is a binary vector whose coordinates define the 
presence or absence of predefined structural fragments 
within a molecule [30] . A fingerprint is also a one dimen- 
sional representation of a chemical compound and it 
is widely used for chemical similarity search in large 
databases [31]. It is also worth noting that a finger- 
print is not a unique chemical compound representa- 
tion because it encodes only a fragment of a molecule. 
There can be two different molecules having the same 
fingerprint representation. 

A predictive model M is a mapping X —> Y, where 
Y C M is the output space. The output space Y might, 
for example, represent a particular biological, physical or 
chemical activity of a chemical compound. 

The input data is represented by the pairs: {x^yi) G 
X x Y for i = 1, . . . , n, where X{ is an element of the 
chemical space and yi is the measured activity of that ele- 
ment. There is also a set of m predictive models M = 
{Mi, . . . ,M m } associated with the activity Y. These mod- 
els were generated using various statistical or data mining 
techniques and they have different applicability domains 
and performances. To identify the most predictive model 
from the collection of models A4 for a new chemical 
compound x, we define a partitioning model that splits 
the chemical space into disjoint groups and allows an 
unambiguous model identification. 

A partitioning model M is a mapping X — »► Y given by 
the following formula: 



M(x) = 



M\(x), x e Di, 
M 2 {x) 1 x e D 2 , 

fylffj(x)) X G Dyyi) 



where 



• D±, . . . , D m c X are disjoint, 



capacity and to reduce the number of selected descrip- 
tors. In this paper we present how Pareto optimality can 
be applied in QSAR model identification. In the following 
sections we recall the basic definition of the Pareto set and 
we propose an algorithm that finds Pareto points in 2D 
vector space. 

Pareto points and their properties 

Let consider a vector v = [fi,f2> • • • >fi<] in the K- 
dimensional space. Let ttj(v) = fj denote a ;-th coordinate 
of vector v and V be a finite set of vectors in R K . 

Definition 1 (Domination). A vector v G IR 7< " is dominated 
by a vector w e M. K , which is denoted by v <w, if 



Ttjiy) < 7tj(w), V;= 1,...,/C. 



(1) 



We say that v is strictly dominated by w (v < w),ifv < w 
and v 7^ w, i.e. 

V;= 1,...,K 7tj(v) < 7tj(w), 3j=i,...,k 7tj(v) < 7Tj(w). (2) 

Definition 2 (Comparison). Vectors v,w e M. K are incom- 
parable, which we denote by v ~ w, if neither v < w nor 
w < v. 

Note that v ~ w if and only if there exist i,j e {1,...,K}, 
i j, such that 



7T/(v) < Tiiiw) and 7tj(v) > 7tj(w). 



(3) 



Definition 3 (Pareto set). A set V C V of minimal vectors 
with respect to < is called a Pareto set for V. 

Note that V consists of incomparable vectors. We can 
define V equivalently by the formula 



r = {v G V : V we v v < w V v ~ w}. 



(4) 



The above definitions and basic properties of the Pareto 
set can be found in [34]. Now, we introduce below some 
properties of Pareto sets and Pareto order that are used in 
the following sections. First, we introduce the convenient 
notation. Let 



The main hypothesis in predictive modeling is that sim- 
ilar chemical compounds have similar properties [32]. 
Following this hypothesis we build the partitioning model 
that it splits the chemical space in groups in order to 
maximize the similarity of their chemical compounds and 
to minimize the error of a model associated with this 
group. It is easy to notice that this is a bi-criteria problem 
and the solutions have to represent a trade-off between 
optimality of these criteria (the so-called Pareto points). 
Pareto optimality is a multi-criteria optimisation tech- 
nique widely applied in decision making problems [29]. 
In QSAR modelling multi-objective (criteria) was used 
for feature selection [33] in order to maximize predictive 



and 



jmin min{7r .( v ) : v G V}, j=l,...,K, (5) 



Vj:={veV:7t J (v)=ff li % j=l,...,K. (6) 



The set Vj consists of all vectors in V with minimal value 
on the ;-th coordinate. 

Lemma 1. Let Tj be the set of all minimal vectors in Vj. 
Then Tj c Y, where Y is the Pareto set for V. 
Let ir = U/=i,...,/c r ; and 

jjnax ._ max { 7t .(y) : V G If}, j=l,...,K. (7) 
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In particular, ir is a subset of T and it is called an initial 
Pareto set. Now we establish the dependence of the condi- 
tions for incomparability with vectors in this initial Pareto 
set. 

Lemma 2. If a vector v e V is incomparable with all vec- 
tors in IT, then there exist at least two indices j e [1,...,K] 
such that 

7t){v) e (f™",f™*). (8) 

The proof of this Lemma 1 and Lemma 2 as well as all 
other results in the paper are provided in Appendix 1. 

Pareto order in two dimensions 

This subsection is devoted to the study of the two- 
dimensional case, i.e. K = 2. We shall use the notation 
introduced above. 

Lemma 3. The set IT has at most two elements. 

1. If\lT\ = 1, then IT is the Pareto set for V. 

2. If \IT\ = 2, then a vector v e V is incomparable with 
vectors in IT if and only if 

V;=l,2 7tj( V ) € (f^J^). (9) 

As shown in Figure 1 and Figure 2, when ir consists 
of two elements w\ and W2, a set of vectors incompara- 
ble with ir is given by the rectangle V. Let y be a vector 
incomparable with ir, i.e. y e V. The introduction of vn 
divides the rectangle V into three areas: 

• A! and A!' is a set of vectors incomparable with 

iru{ K }, 

• B is a set of vectors smaller then y, 

• C is a set of vectors bigger then y . 



2' 
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•K 


w 2 














— ► 

1 



Figure 1 A space V of incomparable vectors bounded by 
coordinates vectors wi,W2 e IT. 
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Figure 2 A partition of space V when a new vector y is 
introduced. 



The above properties of ir and vectors incomparable 
with ir allow us to limit the search space V to find Pareto 
solutions. 

Finding a Pareto set in 2D vector space 

In this section, we present an algorithm for finding a 
Pareto set in two-dimensional space (see Algorithm 1). 
FIND-PARETO-SET( 1 l/) is a recursive algorithm that finds 
all Pareto points in the rectangle V defined by two points 
in the initial Pareto set ir (see Lemma 1); this rectan- 
gle contains all points from V. The algorithm starts from 
finding a point y that does not dominate any other points 
in V (line 4). This point splits the area V into four rectan- 
gles (see Figure 2). According to Lemma 2 and Lemma 3, 
B fl V = 0, C does not contain Pareto points, whereas 
points in rectangles A f and A" are incomparable with y. 
The above procedure is recursively repeated for VHi A' and 

v n A". 



Algorithm 1 FIND-PARETO-SET( V) 
l: if V = 0 then 
2: return 0 
3: end if 

4: y +- FIND-PARETO-POINT(V) 

5: Qi = (V{y}) n ((-oo,/i(y)] x[/ 2 (y),oo)) 

6: Q 2 = (V {/}) n (L/Ky), oo) x (-ooMy)} ) 

7: r = {/} U FIND-PARETO-SET(Qi) U FIND- 

PARETO-SET(Q 2 ) 
8: return T 



The algorithm sketched above calls FIND-PARETO- 
POINT(V") (see Algorithm 2) to find a Pareto point in the 
set V, This procedure works in the pessimistic time 0(n 2 ), 
where n is a number of elements in V (when all solu- 
tions are comparable, i.e., to form a chain it may take n 
iterations to find a Pareto point). However, the expected 
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running time is much shorter thanks to the random 
selection of points. 



Algorithm 2 FIND~PARETO~POINT( VO 

l: if V = 0 then 
2: return 0 

3: end if 

4: select v randomly from V 

5: while v dominates points from V \ {v} do 

6: V +- {v e V \ {v} : v < v} 

7: select v randomly from V 

8: end while 

9: return v 



Model identification in predictive toxicology 

Following the similarity hypothesis researchers build 
models for groups of chemicals that have a common 
molecular fragment or common properties. These models 
are more reliable and give better predictions for chemi- 
cals that lie in the model applicability domains. Further, 
high quality models developed for a small subset of chem- 
ical space can be combined in a global model that covers 
larger chemical space using various ensemble techniques. 
In this section we present how to identify a reliable model 
from a collection of already existing models for new before 
unseen chemicals. 

The chemical space X is a set of chemical compounds 
represented by the combination of all possible existing 
chemical descriptors, and for a given endpoint there is a 
collection of existing models Ai. For each chemical com- 
pound x £ X, model predictions Y' — {y f v . . . , y f m } for 
models from M are known (see Figure 3). To identify a 
model for a given query chemical compound q we convert 
the set of chemicals from X and their model performances 
into a set of pairs (du ei m ), where d{ represents the dis- 
tance between q and the i-th chemical compound from 
the chemical space. The error e/ m = \y(xi) — y' m {pc\)\ 
defines the model performance for the m-th model from 
A4 and for the i-th chemical compound. In a set of such 
pairs, one can find models that have a low predictive 
power for the most similar chemical compounds whereas 
the other gives better predictions. This illustrates the sit- 
uation often encountered in multicriterial optimization 
problems: there is no solution that outperforms the others 
with respect to all criteria. Hence, instead of having one 
solution we have a set of solutions that cannot be com- 
pared to each other. The above task is a Pareto problem: 
one has to balance similarity to existing chemical com- 
pounds and correctness of predictions offered by available 
models. 

The model identification procedure (see Algorithm 3) 
can be described as follows: for a query chemical 



compound q and a given chemical space - 1) create the set 
V of pairs {du eim), 2) find the Pareto set for V, 3) select 
the most suitable model for q. To create a set V we start 
from the array T (see Figure 3) that contains a structural 
representation of the chemical compound, its measured 
activity (for a given endpoint) and predictive performance 
of each model from Ai. 



Algorithm 3 MODEL-IDENTIF Y( T, q) 

l: V^lNlTiZq) 

2: r <r- FIND-PARETO-SET(V) 

3: if |r| = 1 then 

4: return modelld of the sole element of F 
5: else 

6: return FIND-MODEL-ID(r) 
7: end if 



After executing MODEL-IDENTIF Y( T, q), in line 1, the 
array T is converted into a list of vectors V using proce- 
dure INIT(r, q) (see Algorithm 4). Every vector v; e V 
is defined as a pair of the distance between q and the 
i-th chemical compound from T, and the error of the 
;-th model from M. for the compound i. The distance 
dqi = 1 — STqi is calculated using Tanimoto coefficient 
ST, which is the most frequently used similarity mea- 
sure in chemoinformatics [35]. This coefficient works with 
fingerprints (binary representation of molecules) and is 
defined as a ratio between the number of bits set on the 
same position in two fingerprints and the sum of bits set 
on different positions. The model error e/y is defined as a 
distance between the true activity for compound i and the 
value computed by model We treat V as a set of all pos- 
sible solutions for model identification for a given query 
molecule q and known chemical sub-space. 



Algorithm 4 INIT(T, q) 

1: V<-0 

2: for i = 0 to rows(r) do 

3: for j = 0 to models(r) do 

4: calculate the distance d q i and error e,y 

5: V=VU{(d qi , ei j)} 

6: end for 

7: end for 

8: return V 



In line 2 of MODEL-IDENTIF Y(T, q), we call FIND- 
PARETO-SET(y) to find the set of all Pareto points T in V. 
Then, we analyse points in V in order to choose the most 
predictive model for q. In the case when |T| = 1, there 
is only one candidate, so the choice is trivial. This case 
is comparable to the algorithm proposed in [28] which 
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Table "default" - Rows: 11 | Spec - Columns: 14 Properties | Flow Variables 



Row ID 


1 CAS 


S NAME 


3£ SMILES 


D Loq.l.l... 


D 


0 rl 


D Ir2 


D Ir3 


0 1-4 


0 mlrNPN 


O mlrPN 


749 


95749 


3-Chloro-4-methylaniline' 


.XX 


0.39 


-0.169 


-0.556 


-0.328 


0.215 


0.144 


-1.172 


-0.329 


750 


87605 


3-Chloro-2-methylaniline' 


rV 


0.38 


-0.169 


-0.556 


-0.328 


0.215 


0.144 


-1.172 


-0.329 


751 


6627550 


2-Bromo-4-methylphenol' 


JCf 

Br 


0.6 


0.505 


0.128 


0.338 


0.838 


0.669 


-0.51 


0.166 


752 


16532799 


4-Bromophenyl acetonitrile' 


y 

Br 


0.6 


0.333 


-0.154 


0.032 


0.827 


0.598 


-1.189 


-0.342 


753 


95885 


4-Chlororesorcinol' 


CI 


0.13 


-0.094 


-0.469 


-0.24 


0.27 


0.196 


-1.048 


-0.237 


754 


621590 


3 -Hyd roxy-4 -m et h oxyb e nza 1 d e hy d e ' 


1 

0 


-0.14 


-0.22 


-0.683 


-0.474 


0.273 


0.149 


-1.581 


-0.635 


755 


3218368 


4-Biphenylcarboxaldehyde' 




1.12 


0.789 


0.535 


0.77 


0.934 


0.819 


0.336 


0.797 



























Figure 3 Collection of models for the IGC50 prediction for Tetrahymena pyriformis. The first three columns include chemical compound 
representation. The fourth column represents the measured value of IGC50. The presentation of model predictions starts from the fifth column. 



selects the most predictive model for the most similar 
chemical compound of q. In the case when T consists of 
many Pareto points, the model identification becomes a 
difficult task: the Tanimoto similarity coefficient (as well 
as other fingerprint similarity measures) between chemi- 
cal compounds may not be correlated enough with their 
activity partially contradicting the similarity hypothesis 
[32] (see the end of this section for a detailed example). 
To identify a model using Pareto points, first we define 
n-Pareto Neighbourhood as follows: 

Definition 4. n-Pareto Neighbourhood is a set with at 
most n - Pareto points from Y which are at distance less 
than t from the element q where x > 0 and n > 0. 

The threshold r is selected by experiment and depends 
on the chemical similarity within a given chemical space. 



Having defined the Pareto neighbourhood for a given 
chemical compound q, we provide two methods for model 
identification. The first one is called ^-Average Pareto (see 
Algorithm 5). The threshold r provides means for remov- 
ing those chemical compounds which are dissimilar to 
the query compound q but their activity is very well pre- 
dicted by some model. Next, the model average model 
errors for the chemicals represented by Pareto points and 
then the model with the smallest average error is selected. 
We call this method n-Average Pareto Model Identifica- 
tion (h-APMI). The usage of Pareto neighbourhood in 
comparison with the standard nearest neighbourhood is 
that this method is more sensitive on model performances 
and allows for the rejections of the similar chemical com- 
pounds on which models perform badly. 

The second method is called n-Centroid Pareto (see 
Algorithm 6). For all Pareto points from the n-Pareto 
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Algorithm 5 Average Pareto 



FIND-MODEL-ID(r, T, n, r) 
l: h-PN <— n-Pareto neighbourhood for a given n and 

the threshold r 
2: X f <— all chemical compounds linked to points in n- 

PN (use T to accomplish this task) 
3: compute for each model average error on chemical 

compounds from X' 
4: return Id of the model with smallest average error 



Neighbourhood the centroid Pareto point c is calculated 
according to formula: 



c = (d c , e c ) = ( 



^2pen—PN 

\n-PN\ ' \n-PN\ 



dp E 



<pen-PN ^p 



)■ 



(10) 



where d c is the average of distances and e c is the average of 
model errors for all Pareto points from the neighbourhood 
(n — PN), In the next step the Euclidean distance between 
Pareto points and the centroid is computed. The model 
that is associated with the Pareto point for which the 
Euclidean distance to the centroid is minimal, is selected. 
We call this method n-Centroid Pareto Model Identifica- 
tion (h-CPMI). According to the definition, both h-APMI 
and h-CPMI are partitioning models that splits chemical 
space into disjoin groups and allow unambiguous model 
identification. 



Algorithm 6 Centroid Pareto 

FIND-MODEL-ID(r, T, n, r) 
l: h-PN <— n-Pareto neighbourhood for a given n and 

the threshold r 
2: for all points from h-PN calculate the centroid c 
3: for each point from h-PN calculate the Euclidean 

distance to the centroid 
4: return Id of the model having the Pareto point with 

the smallest distance to the centroid. 



We mentioned above that similar chemical compounds 
might have very different measurements of activity. To 
demonstrate this, we analysed the TETRATOX [36] 
dataset which contains growth inhibition concentration 
(IGC50) for Tetrahymena pyriformis. Chemical com- 
pounds were compared in pairs. Their Tanimoto similar- 
ity coefficient and differences in measured activity were 
collected. Summarised results are displayed in Table 1. 
Column headers hold differences in the measured activ- 
ity between two chemicals, while row headers describe 
molecule similarity threshold. The single cell of this array 
represents a number of pairs of chemical compounds for 
which the distance is smaller than the row identifier and 



the difference in the activity is smaller than the column 
identifier. 

The TETRATOX dataset contains over one thousand 
chemical compounds and the biggest difference between 
measured values of IGC50 is equal to 5.3. Notice that 
the number of pairs of chemicals that are similar, based 
on both the fingerprint similarity and the activity, is very 
small. There is only one pair of chemical compounds 
that have the same activity and maximal similarity (1- 
row, 1 column). On the other hand, there are many 
chemicals which are similar fingerprint-wise but have 
different activities. This makes the model identification 
challenging. 

In the next section we present results of the experiments 
that were carried out in order to demonstrate how model 
identification works. 

Experimental results 

Two experiments were proposed in order to demonstrate 
the advantages of model identification for predictive toxi- 
cology. Each experiment has two phases. In the first phase 
we treated model identification as a classification prob- 
lem to study the performances of proposed methods in 
comparison with the other classification algorithms. We 
defined an "oracle model" that associates each chemical 
compound from a given chemical space with the most pre- 
dictive model from the collection of existing models and 
we used this model to validate our methods. In the sec- 
ond phase, for each chemical compound we applied an 
identified model to predict the growth inhibition concen- 
tration (IGC50) in the first experiment and Partition coef- 
ficient (LogP) in the second. Finally, we compared these 
results with the original model performances applied to 
the whole chemical space. 

IGC50 Prediction for Tetrahymena Pyriformis 

A dataset (Tetrahymena Pyriformis Toxicity - TPT) of 
1129 chemicals was obtained from the INCHEMICOTOX 
webpage [37]. This dataset is compiled of toxicity data for 
the unicellular ciliated protozoa Tetrahymena pyriformis 
(see [38]) and was published in [39]. The measure of toxi- 
city is 50% growth inhibition concentration (IGC50). Two 
QSAR regression models were obtained from INCHEMI- 
COTOX. These models are also reported in the JRC QSAR 
Models Database. The first, non polar narcosis (NPN) 
QSAR [40], was originally trained on 87 chemicals iden- 
tified as non polar narcotics with q 2 = 0.95. The linear 
regression model was defined as follows: 

log(l//GC50) = 0.83 logP - 2.07, 

where logP is the octanol-water partition coefficient. 
The second, polar narcosis (PN) QSAR model [41] for 
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Table 1 Analysis of chemical compound similarities in order to highlight the difference of the chemical activity for the 
TETRATOXdataset 



fsim/diffactiv 


0 


0.1 


0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


0 


1 


2 


2 


2 


2 


2 


2 


2 


0.1 


3 


13 


27 


44 


51 


62 


70 


79 


0.2 


6 


112 


220 


335 


431 


512 


585 


655 


0.3 


16 


318 


617 


933 


1213 


1474 


1719 


1928 


0.4 


32 


720 


1402 


2081 


2701 


3297 


3840 


4328 


0.5 


66 


1380 


2726 


4042 


5227 


6437 


7536 


8547 


fsim/diffactiv 


0.8 


0.9 


1 


1.1 


1.2 


1.3 


1.4 


1.5 


0 


2 


2 


2 


2 


2 


2 


2 


2 


0.1 


84 


90 


93 


96 


99 


103 


104 


104 


0.2 


700 


753 


782 


801 


827 


842 


849 


858 


0.3 


2106 


2278 


2412 


2507 


2621 


2715 


2784 


2821 


0.4 


4763 


5160 


5526 


5837 


6119 


6360 


6575 


6724 


0.5 


9481 


10362 


11167 


11840 


12488 


13082 


13589 


14004 



Tetrahymena pyriformis, was trained on 138 polar nar- 
cotics chemicals with q 2 = 0.75 and defined as follows: 

log(l//GC50) = 0.62 logP - 1.00. 

Training datasets for both models were obtained from JRC 
QSAR Models Database. These datasets were compared 
with the Tetrahymena pyriformis dataset and 204 (136 
from the PN model and 68 from the NPN models) train- 
ing chemicals were present in the TPT dataset. We did 
not perform any data curation for this dataset. The above 
described models were implemented for the logP value 
calculated using the cdk library [42] and used to predict 
toxicity for the TPT datasets. 

First, we considered the model identification problem 
as a classification problem to predict which model will be 
the most reliable for a given chemical compound. Hav- 
ing a dataset of the predicted IGC50 for both models 
and the measured value, we used a priori information 
("oracle model") about the best selected model for each 
chemical compound and we applied various classification 
methods. To simulate the model identification for before 
unseen chemical compounds the leave-one-out (LOO) 
method was used. This methods takes out one chemical 
compound from the dataset and uses others chemicals to 
predict which model would be the most reliable for it. This 
procedure were repeated for all chemicals in the dataset. 

Table 2 includes results from the comparison of n- 
CPMI and n-APMI proposed in this paper with the DMS 
(Double Min Score algorithm) [28] and with the standard 
classification algorithms such as: NaiveBayes, BayesNet 
decision trees (PART and J48), nearest neighbour (IBK) or 
support vector machine (SMO) implemented in WEKA 



[43]. These classifiers were initialised by the default 
parameter settings. The dataset, used to generate these 
classification models, consisted of chemicals represented 
by binary descriptors (1024 - bit fingerprints calculated 
using cdk library) and the model errors. We compared all 
classifiers according to a number of the correctly classi- 
fied chemicals and the classifiers accuracies. The 3-APMI 
methods gives the highest number of correctly classified 
elements and relatively low numbers for false positive and 
false negative - especially comparing this method to the 



Table 2 Comparison of classification algorithms according 
to a number of correctly classified elements, false positive, 
false negative and the classifiers accuracies 



Method 


Correct class 


False 
positive 


False 
negative 


Accuracy 


SMO 


899 


122(10.8%) 


1 06 (9.4%) 


0.80 


Part 


904 


123(10.9%) 


101 (8.9%) 


0.80 


NaiveBayes 


845 


191 (19%) 


90 (7.9%) 


0.75 


J48 


905 


123(10.9%) 


1 00 (8.9%) 


0.80 


IBK(1) 


905 


121 (10.7%) 


102 (9%) 


0.80 


IBK(3) 


901 


133(11.7%) 


94 (8.3%) 


0.79 


IBK(5) 


889 


149(13.2%) 


93(8.2%) 


0.78 


BayesNet 


756 


264 (23%) 


1 08 (9.5%) 


0.67 


DMS 


901 


115(10.1%) 


112(9.9%) 


0.79 


3-CPMI 


902 


136(12%) 


90 (7.9%) 


0.79 


5-CPMI 


897 


137(12%) 


94 (8.3%) 


0.79 


10-CPMI 


863 


168(14.8%) 


97 (8.5%) 


0.76 


3-APMI 


918 


99 (8.7%) 


111(9.8%) 


0.81 


5-APMI 


891 


115 (10%) 


122(10.8%) 


0.78 



The polar narcosis model label was defined as the positive class. 
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IBK(3). The 3-APMI uses the 3-Pareto neighbourhood 
where as IBK(3) uses the 3-nearest neighbourhood for 
classification. This shows that the model identification 
using Pareto points is as good as or can be better than the 
other well know classification algorithms. 

The decision on model identification relies on the dis- 
tance to the Pareto points. Figures 4 and 5 show misclas- 
sification examples for the 3-APMI method. On Figure 5 
for 3-Phenyl-l-propanol the NPN model was identi- 
fied. Its Pareto neighbourhood included three chem- 
icals: 4-Chloro-3-methylphenol ) Methylbenzene and 4- 
Dimethylbenzene with the distances and models errors 
shown in Table 3. The 3-APMI model averages model 
errors for all Pareto points in this neighbourhood and 
selects the one with the smallest average, in this case 
the NPN model. One can notice that the best model 
for this Pareto neighbourhood is the NPN model for 4- 
Dimethylbenzene whereas this chemical compound is not 
the most similar to the query chemical compound. 

To demonstrate a correct classification example, we 
selected Benzylamine that was associated correctly with 
the PN model. Its Pareto neighbourhood included 
two chemicals: 2-Chloro aniline and (+/-)- l,2-Diphenyl-2- 
propanol with distances and model performances shown 
in Table 4 (notice that according to Definition 4, the three 
Pareto neighbourhood consists of at most three Pareto 
points). These distances to the query chemical compound 
are small and for both chemicals the PN model gives the 
most reliable prediction. The 3-APMI identifies the PN 
model that has the minimal average error for all Pareto 
neighbours. 



5 Name 

sec-Phenethyl alcohol 


S CAS.N... 
98-85-1 


OS SMILES 


S Model.... 
PN 


S oracle, f 
IMPN 


Benzyl chloride 


100-44-7 




PN 


NPN 


Benzene 


71-43-2 


0 


PN 


IMPN 


Ethoxybenzene 


103-73-1 


c 
Of 


PN 


NPN 


Chlorobenzene 


108-90-7 


cr 


PN 


NPN 



Figure 4 Chemical compounds wrongly associated with the PN 
model by 3-APMI. 



8 Name 


8 CAS.N... 


(M SMILES 


8 Model.... 


8 oracle." 


3-Phenyl-l-propanol 


122-97-4 




NPN 


PN 


4-Ethylbenzylalcohol 


768-59-2 


0 


IMPN 


PN 


Methylbenzene 


108-88-3 


cr 


NPN 


PN 


2-Amino-5-chlorobenzonitrile 


5922-60-1 




NPN 


PN 


4-Biphenylmethanol 


3597-91-9 




NPN 


PN 



Figure 5 Chemical compounds wrongly associated with the NPN 
model by 3-APMI. 



Additionally, from the entire TPT dataset, chemicals 
included in the original training datasets for both mod- 
els were selected. We identified 4 out of 68 chemicals that 
were used to train the NPN model but the oracle model 
associated them with the PN model (see Figure 6). The 
same analysis were repeated for the training dataset of the 
PN model and we identified 9 out of 136 chemicals that 
were associated with the NPN model by the oracle model 
(see Figure 7). 

To predict IGC50 for the TPT dataset we used the 
identified model for each chemical compound in this 
dataset. The results obtained for the entire dataset are 
shown in Table 5. The statistics used are: R2 - correlation 
coefficient for the observed and predicted values, RSE - 
root-squared error, Q2 - predictive squared correlation 
coefficient, MAE - mean absolute error and RMSE - root 
mean square error. The "oracle model" has the knowledge 



Table 3 Model performances and distance comparison of 
the 3-Pareto neighbourhood of the 3-Phenyl-l-propanol 



Name 


Distance 


PN 


NPN 


Methylbenzene 


0.33 


0.37 


0.28 


4-Dimethylbenzene 


0.36 


0.54 


0.08 


4-Chloro-3-methylphenol 


0.30 


0.61 


1.14 
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Table 4 Model performances and distance comparison of 
the 3-Pareto neighbourhood of the Benzylamine 



Name 


Distance 


PN 


NPN 


2-Chloroaniline 


0.08 


0.30 


0.38 


(+/-)- 1 ,2-Diphenyl-2-propanol 


0.11 


0.041 


0.59 



of the best model for each chemical compound. Its pre- 
dictivity is low because we used only two existing models 
from JRC QSAR database that were designed based on 
mode-of-action (polar/non polar narcosis) for chemicals 
from TPT. 

The 3-APMI method provides the best prediction 
among "non-oracle models" The first two rows present 



prediction statistics for PN and NPN models. They are 
lower than for all other models. Notice, however, that their 
R2 and RSE statistics are identical. This is due to the fact 
that both models are affine functions of one and the same 
explanatory variable. An affine function can, therefore, 
transform one model into another. This is what happens 
when regression is applied to compute R2 and RSE. Notice 
that other two measures of Q2 and predictive errors are 
different for these models. 

As another example, we considered only a small sub- 
set of the whole initial TPT dataset that contains only 
376 chemical compounds. This dataset includes all train- 
ing chemicals used in PN and NPN models plus over 100 
additional chemicals from the TPT dataset. We included 
chemicals for which the absolute error of the oracle model 



S "Name" 



S "CAS.N.. 



"SMILES" 



D "Loq.l... 



S "oracle" 



S Model. 



1,2-Decanediol 



1119-86-4 



0.764 



PN 



NPN 



1-Dodecylakohol 



112-53-8 



2.161 



PN 



NPN 



Tridecvlalcohol 



112-70-9 



2.45 



PN 



NPN 



Di-n-heKyl ketone 



462-18-0 



1.521 



PN 



NPN 



Figure 6 Chemical compounds that were originally used to train the NPN model but associated with the PN model by the oracle. 
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S "Name" 



S ▼"CA... 



■ "SMILES" 



D "Loq.l... 



S "oracle" 



Model, Name 



4-tert-Butylphenol 



98-54-4 




0.91 



NPN 



NPf-J 



2,5-Dimethylphenol 



95-37-4 



0.08 



NPN 



PN 



3,4-Dimethylphenol 



95-65-8 



0.12 



NPN 



PN 



2-Methylphenol 



95-48-7 



-0.29 



NPN 



PN 



2-Ethoxyphenol 



r 



94-71-3 




■0.36 



NPN 



PN 



4-tert-Pentylphenol 



80-45-6 




1.23 



NPN 



NPN 



2,3,5 -Tri m et hyl phe n o I 



697-82-5 




0.36 



NPN 



NPN 



Salicylic acid 



69-72-7 




■0.51 



NPN 



NPN 



3-Hydroxybenzylalcohol 



620-24-6 




-1.04 



NPN 



NPN 



Figure 7 Chemical compounds that were originally used to train the PN model but associated with the NPN model by the oracle. 
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Table 5 Analysis of model prediction accuracies for IGC50 
for Tetrahymena pyriformis 



Method Name 


R2 


RSE 


Q2 


MAE 


RMSE 


NPN 


0.58 


0.66 


0.1 5 


0.69 


0.94 


PN 


0.58 


0.66 


0.58 


0.50 


0.66 


DMS 


0.68 


0.56 


0.62 


0.43 


0.62 


3-CPMI 


0.67 


0.58 


0.60 


0.43 


0.63 


5-CPMI 


0.66 


0.59 


0.59 


0.44 


0.65 


10-CPMI 


0.65 


0.60 


0.57 


0.44 


0.66 


3-APMI 


0.69 


0.56 


0.65 


0.41 


0.60 


5-APMI 


0.68 


0.57 


0.62 


0.42 


0.62 


Oracle 


0.75 


0.50 


0.71 


0.35 


0.54 



is less than 0.4 and they are in the applicability domain of 
both models. The value of log P e [ —0.5, 6.2] and the toxi- 
city value is in the range [ —2.5, 3.05]. Again we compared 
various classifiers that were used for model identification 
(see Table 6). 

In this case, the best method is 3-CPMI that from the 
3-Pareto neighbourhood selects model for which Pareto 
point is the closest to the neighbourhood centroid. This 
method gives better results if compared with the DMS 
method that selects the model with the smallest error 
for the nearest neighbour. Tables 7 and 8 show the list 
of chemicals that were wrongly classified by the 3-CPMI 
algorithm. Comparing the regression models for IGC50 



Table 6 Comparison of classification algorithms according 
to a number of correctly classified elements, false positive, 
false negative and the classifiers accuracies 



Method 


Correct class 


False 
positive 


False 
negative 


Accuracy 


SMO 


296 


47(12%) 


33(8.7%) 


0.787 


Part 


303 


34(9%) 


39(10.3%) 


0.805 


NaiveBayes 


281 


67(17%) 


28(7.4%) 


0.747 


J48 


296 


44(11.7%) 


36(9.5%) 


0.787 


IBK(1) 


307 


42(11.1%) 


27(7.1%) 


0.816 


IBK(3) 


300 


42(11.1%) 


34(9%) 


0.797 


IBK(5) 


299 


46(12.2%) 


31(8.2%) 


0.795 


BayesNet 


273 


76(20.1%) 


27(7.1%) 


0.726 


DMS 


297 


48(12.7%) 


31(8.2%) 


0.719 


3-CPMI 


316 


29 (7.7%) 


31(8.2%) 


0.844 


5-CPMI 


305 


33(8.7%) 


38(10.1%) 


0.811 


10-CPMI 


288 


41(10.9%) 


47(12.5%) 


0.766 


3-APMI 


306 


33(8.7%) 


37(9.8%) 


0.813 


5-APMI 


300 


41(10.9%) 


35(9.3%) 


0.797 



The polar narcosis model label was defined as the positive class. 



Table 7 Chemical structures wrongly associated with the 



PN model by 3-CPMI 


CAS 


Smiles 


4097498 


CC(Q(QC1=CC(=C(C(=C1)[N+](=0)[0-])0)[N+](=0)[0-] 


6920225 


CCCCC(0)CO 


928972 


ccc=ccco 


10031875 


CCC(CC)COC(=0)C 


112141 


C(C)(=0)OCCCCCCCC 


1 05668 


C(CCC)(=0)OCCC 


624544 


0(C(CC)=0)CCCCC 


1 23660 


C(CCCCC)(=0)OCC 


123159 


CCCC(C=0)C 


2987168 


CC(C)(CC=0)C 


96480 


OCICCCOI 


1 9686738 


CC(CBr)0 


4620706 


C(NCCO)(C)(C)C 


1 1 1 864 


CCCCCCCCN 


597977 


C(N=C=S)(C)(C)CC 


17112822 


c1c2c(CN=C=S)cccc2ccd 


1138529 


CC(C)(C)C1=CC(=CC(=C1)0)C(C)(C)C 


142303 


C(#CC(Q(QO)C(Q(QO 


31333138 


cccccc#ccco 


107879 


CC(CCC)=0 


2067336 


OC(CCCCBr)=0 


91156 


N#Cc1c(C#N)cccc1 


2065238 


c1(ccccc1)OCC(OC)=0 


613978 


N(CC)(C)c1ccccc1 


586787 


[N+](c1ccc(cc1)Br)(=0)[0-] 


91667 


c1(N(CC)CC)ccccc1 


38713563 


0(CCCCCCCCC)C(=0)c1 ccc(0)cd 


622468 


C(Oc1ccccc1)(=0)N 


93914 


C(CC(=0)C)(=0)c1ccccc1 


2216946 


C(#Cc1ccccc1)C(=0)OCC 



(see Table 9), 3-CMPI method provides better prediction 
than DMS, PN and NPN models. 

The above examples show the great potential of the 
model identification methods. We demonstrated that the 
method based on pre-defined rules (such as maximal 
similarity for chemicals and minimal error for a model 
assigned with them) can be compared with the standard 
machine learning algorithms for the classification prob- 
lem. Model identification can be considered as an ensem- 
ble technique to build high predictive consensus models 
in predictive toxicology. 
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Table 8 Chemical structures wrongly associated with the 
NPN model by3-CPMI 



Table 9 Analysis of model prediction accuracies for IGC50 
for the reduced TPT dataset 



CAS 


Smiles 


Method name 


R2 


RSE 


Q2 


MAE 


RMSE 


29338496 


CC(C(C1 =CC=CC=C1 )C2=CC=CC=C2)0 


NPN 


0.84 


0.37 


0.60 


0.44 


0.57 


100447 


C1=CC=C(C=C1)CCI 


PN 


0.84 


0.37 


0.75 


0.33 


0.46 


1823912 


CC(C#N)C1=CC=CC=C1 


DMS 


0.89 


0.30 


0.88 


0.20 


0.32 


103695 


CCNC1=CC=CC=C1 


3-CPMI 


0.92 


0.25 


0.91 


0.16 


0.26 


112538 


qccccccccccoo 


5-CPMI 


0.90 


0.28 


0.89 


0.18 


0.29 


1119864 


C(CCCCCC)CC(CO)0 


10-CPMI 


0.88 


0.32 


0.86 


0.21 


0.33 


628637 


C(C)(=0)OCCCCC 


3-APMI 


0.91 


0.27 


0.90 


0.18 


0.29 


108225 


0(C(=C)C)C(=0)C 


5-APMI 


0.90 


0.28 


0.89 


0.19 


0.30 


94042 


C(C(OC=C)=0)(CCCC)CC 


Oracle 


0.98 


0.10 


0.98 


0.09 


0.11 



1932929 

1732098 

1 1 0623 

36536466 

6261229 

4753597 

20965279 

1577180 

111160 

535137 

600000 

23165448 

1565759 

529191 

141286 

106796 

123728 

22819916 

109524 

2627272 

609938 

3012371 



C(CC)(=0)OCC#C 
0(C(CCCCCCC(OC)=0)=0)C 
C(CCCC)=0 
0=C1CC(C)01 

ccc#cco 

0(CCCCBr)C(C)=0 

N#CCCCCCCBr 

OC(=0)CC=CCC 

C(CCCCCC(=0)0)(=0)0 

C(C(C)CI)(=0)OCC 

CCOC(=0)C(C)(C)Br 

dccc(CCCC)cc1N=C=S 

CCC(C)(C1=CC=CC=C1)0 

CC1=CC=CC=C1C#N 

C(CCCCC(OCC)=0)(OCC)=0 

C(CCCCCCCCC(OC)=0)(OC)=0 

C(CCC)=0 

N#CCCCCCCCI 

C(CCCC)(=0)0 

dccccc1CCCN=C=S 

c 1 (c(c([N+] (=0) [0-])cc(c 1 )C)0) [N +] (=0) [0-] 
C(#N)SCc1ccccc1 



LogP prediction for in-house Syngenta dataset 

For the second experiment we considered the estima- 
tion of the LogP for an internal Syngenta dataset. The 
octanol/water Partition coefficient (LogP) is a measure of 
the lipophilicity of chemical compounds and is an impor- 
tant descriptive parameter in bio-studies [8]. Currently, 
there are various methods for estimating this coefficient: 
fragmental methods (CLOGP, KOWWIN), atom con- 
tribution methods (TSAR, XLOGP), topological indices 
(MLOGP), molecular properties (BLOGP). 

The initial dataset contains about 9000 chemical com- 
pounds and their measured LogP value in Syngentas 
laboratories. The measured value of LogP is in the 



range [ —5.08, 8.65] (see Figures 8 and 9). There was no 
additional data curation than the curation provided by 
Syngenta researchers. Three models to predict LogP: 
CLOGP developed in Syngenta, KOWWIN in EPI Suite 
and MLOGP in Dragon were applied for this dataset. We 
randomly selected 1000 chemicals (out of 9000) and used 
the remaining 8000 chemicals as the chemical space of the 
partitioning model. We used the 3-APMI method as it was 
the best method in the first experiment. We compared the 
performance of these four models on 1000 selected chem- 
icals (see Table 10). We repeated the same experiment 
with 2000 randomly selected chemicals. Additionally, 
we selected from the initial dataset those chemical com- 
pounds for which oracle model has absolute error > 0.7. 
We obtained a set of 2333 chemical compounds. 

Table 10 displays the accuracy of model predictions. The 
3-APMI is generally at least as good as the best model 
(CLOGP). In the case of randomly selected chemicals 




0 2000 4000 6000 

Index 

Figure 8 Syngenta measured LogP dataset. 
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median = 2.11 



25th quantile = 0.99; 75th quantile = 3.21 



"T" 



-4 -2 0 2 4 6 

Figure 9 Summary of Syngenta measured LogP dataset. 



CLOGP was hard to beat, although for 2000 randomly 
selected chemicals one can clearly see the benefit of using 
3-APMI (higher Q2 and lower MAE). The biggest gain is, 
however, observed for those chemicals whose activity is 
difficult to predict (the last experiment). This shows that 
partitioning model (3-APMI) can be a powerful knowl- 
edge extraction tool. 

All methods proposed in the paper were implemented in 
R [44]. The log P value, fingerprints and Tanimoto similar- 
ity were calculated using the RCDK [45] library. A number 
of tests were run to define the threshold r. It is impor- 
tant to notice that the n-Pareto neighbourhood defines 



Table 1 0 Analysis of model prediction accuracies for a 
LogP estimation 



nr chemicals 


Mod. Name 


Q2 


MAE 


RMSE 




CLOGP 


0.83 


0.38 


0.74 


1000 


MLOGP 


0.57 


0.84 


1.19 




KOWWIN 


0.79 


0.47 


0.83 




3-APMI 


0.84 


0.38 


0.74 




CLOGP 


0.76 


0.41 


0.78 


2000 


MLOGP 


0.44 


0.85 


1.2 




KOWWIN 


0.69 


0.50 


0.88 




3-APMI 


0.78 


0.39 


0.72 




CLOGP 


0.37 


1.21 


1.54 


2333 


MLOGP 


0.39 


1.13 


1.52 




KOWWIN 


0.41 


1.01 


1.49 




3-APMI 


0.64 


0.80 


1.16 



the set of at most n-Pareto points. Therefore, for the 3- 
Pareto neighbourhood we found chemicals that have 1, 2, 
or 3 Pareto neighbours for r = 0.4 for the entire TPT 
dataset. For the 5-Pareto neighbourhood x = 0.7 and 
for the 10-Pareto neighbourhood we considered all Pareto 
neighbours. This shows that a size of the Pareto neigh- 
bourhood depends on a size of the available chemical 
space and may vary for different endpoints. Also, looking 
at the results for APMI and CPMI one can notice that it 
is not worth considering all Pareto points, and that the 
size of the Pareto neighbourhood depends on chemical 
compound similarities. 

Conclusion 

In this paper, we draw attention to advantages of model 
reusage in predictive toxicology. Since the amount of 
experimental data and the number of predictive models 
are growing every day, it is crucial to develop automated 
methods for mining models in repositories. The most 
demanding task is to find a model for a new chemi- 
cal compound from a collection of models for a given 
endpoint. 

In this paper, we proposed two methods (APMI and 
CPMI) that identify the suitable model for a query chem- 
ical compound based on the model performances in its 
Pareto neighbourhood. These algorithms are based on 
our simple yet effective method for finding the Pareto set 
in 2D space. The experimental results demonstrate the 
advantage of our approach and indicate that automated 
model identification is a promising research direction 
with many practical applications. Our approach is mainly 
focused on regression models and in the future we plan 
to extend it to classification models, including the anal- 
ysis of model variables in chemical space partitioning. 
An additional interesting direction could address the esti- 
mation of identified model reliability for a new chemical 
compound. 

Appendix 1 Proofs 

Proof (Lemma 1). We prove this lemma by contradic- 
tion. Lets j e {1, . . . ,/<T} and choose v e Vj. Assume that 
v £ T, which is equivalent to saying that there exists w e V 
that is strictly dominated by v, i.e. w < v. This means that 
7tj(w) = 7tj(v) and w e Vj. By the definition of Tj we 
know that v is a minimal vector in Vy, so v < w, which 
contradicts w < v. 



Proof (Lemma 2). Let v e V. First notice that tt/(v) > 

fj"* 1 , j = 1,...,K. If 7Tj(v) i (fminjmax^ for ^ j theR 

tt/(v) > f^^ for all ; and w < v for w e IT. If there exists 
exactly one / e {1, . . .,K] such that 7tj(v) e (f™ in ,f™ ax ), 
then for each index / ^ / we have 777 (v) > f™ ax and there 
exists a vector w e Vj such that w < v. Therefore, if v is 
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incomparable with vectors in ir, none of the above cases 
can take place, and the proof is completed. 

Proof (Lemma 3), Notice first that each Tj,j = 1, 2, con- 
sists of one element, because the Pareto order < induces 
a linear order on the sets Vj. Therefore, ir consists of 
at most two elements. Assume that ir has one element, 
which we denote by w. From the construction of ir we 
have: 

_ / . a rmin _ /..a rmin 

Consequently, w is dominated by every vector of V, so it is 
the only minimal vector in V. 

Assume now that ir consists of two vectors: w\ and w 2 . 

(=^) After renumbering, Y\ = {w\} and T 2 = {w 2 }* 
Hence, we obtain from equations (5) -(7) 

= n 2 (w 2 ), = *2(Wl). 

Due to (3) the set of vectors v e V incomparable with 
ir satisfies (9). 

(<=) Let v e V for which inclusion (9) holds, then 
using renumbering of set Yj, j = 1, 2, from the above 
implication, we obtain: 

7T1(V) >f™» = TTl(Wl), 7T1(V) < f™* = 1X X {W 2 \ 
*2(V) <f? aX = 7T 2 (Wi), 7T 2 (V) >f™ n = 7t 2 (w 2 ). 

According to the Definition 2 and formula (3) we obtain 
v ~ w\ and v ~ w 2 . Since ir = {wi,W2}> then v is 
incomparable with the vectors w\ and w 2 . 
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